Reducing query time of rainfall data using {duckdb} and {arrow} in R

Details

In this demonstration, we use gridded rainfall data from the National Oceanic and Atmospheric Administration (NOAA) to demonstrate the gain in run times and queries using local .duckdb files, the duckdb & arrow R packages.

Note

Some codes relating to connections and credentials have been removed owing to company policy. Only codes needed for demonstration are kept.
This document is prepared as a demonstration of work with database structure and spatial analysis for applying to Code With Us opportunity R Shiny interface for the climr package (downscaled monthly climate data)

All the results are real time

Procedure

The data is saved in two methods.

  • In Amazon AWS S3 buckets as .parquet files, partitioned by years.
    This is how the data is saved.

  • A local .duckdb file that contains the data. This is a single file.

Results

We choose a min_date and a max_date and use them to query data from the two storage methods mentioned above. We will use an R package tictoc to show the run-time differences.

Show the code
get_daily_data = function(conn, date_min, date_max, years_selected, months_selected){
    tictoc::tic()
  temp <- dplyr::tbl(conn, "daily_precip_data") |>
    #Need to have the Year Filter ==> because the data is partitioned with the Years, Otherwise it will get very slow.
    dplyr::filter(Year %in% years_selected) |>
    dplyr::filter(Month %in% months_selected) |>
    dplyr::filter(Date >= date_min, Date <= date_max) |>
    dplyr::group_by(GRIDCODE) |>
    dplyr::summarise(
      precip_sum = sum(Precip_daily) /10, #Keep in mind that we store everything as integer ==> need to divide by 10 to get data in mm!
      From = first(Date),
      To = last(Date)
    ) |>
    dplyr::collect()
  tictoc::toc()
  
  return(temp)
}

Query from S3 buckets .parquet files

query_from_db = get_daily_data(con_db, date_min, date_max, years_selected, months_selected)
174.502 sec elapsed
print(str(query_from_db))
tibble [14,129 × 4] (S3: tbl_df/tbl/data.frame)
 $ GRIDCODE  : int [1:14129] 26133 27033 29133 30933 32733 23434 24934 30934 31243 31843 ...
 $ precip_sum: num [1:14129] 21.6 74.5 197.7 328.3 355.8 ...
 $ From      : Date[1:14129], format: "2022-01-01" "2022-01-01" ...
 $ To        : Date[1:14129], format: "2022-01-31" "2022-01-31" ...
NULL

Query from local .duckdb file

query_from_local = get_daily_data(con_local, date_min, date_max, years_selected, months_selected)
0.133 sec elapsed
print(str(query_from_local))
tibble [14,129 × 4] (S3: tbl_df/tbl/data.frame)
 $ GRIDCODE  : int [1:14129] 33621 27022 26123 26423 27923 32123 26724 28824 29424 30624 ...
 $ precip_sum: num [1:14129] 0 0 0 213 87 ...
 $ From      : Date[1:14129], format: "2022-01-01" "2022-01-01" ...
 $ To        : Date[1:14129], format: "2022-01-31" "2022-01-31" ...
NULL
Note

Notice the run-time improvement with the local file implementation

A similar method may be used for Phase 2 - Database optimization of the project if duckdb is allowed. Otherwise, a similar method using PostGIS may be implemented

Map using mapgl

Finally, a map using {mapgl} to demonstrate my experience of working with climate data.

Show the code
tooltips = glue("<strong>{map_data$COUNTY_NAME}, {map_data$STATE_NAME}</strong> <br><strong>{map_data$precip_sum}mm</strong>")
map_data = cbind(map_data, tooltips)

colors = c("#fff", viridisLite::turbo(8, direction = -1)[4:8])

map = maplibre(style = carto_style("positron")) %>%
  add_line_layer(
    id = "border",
    source = us_states,
    line_color = "#aaa",
    line_cap = "round"
  ) %>% 
  fit_bounds(us_states, animate = FALSE, padding = 60) %>% 
  add_fill_layer(
    id           = "precipitation_data",
    source       = map_data,    # The joined grid plot with precip_sum values
    fill_color   = interpolate(
      column = "precip_sum",     # The column in your dataset to base the colors on
      values = c(0,1,40, 100, 200, 400),   # Precip_sum value breakpoints
      stops = colors,  # Colors for each breakpoint
      na_color = "transparent"   # Use transparent for NA values or 0 values
    ),
    tooltip = "tooltips",
    hover_options = list(
      fill_color = "black",
      fill_opacity = 1
    ),
    fill_opacity = 0.5  # Adjust opacity for non-zero values as needed
  ) |> 
  add_legend(
    "Cumulated rainfall(mm)",
    values = c(0,1,40, 100, 200, 400),
    colors = colors,
    position = "bottom-right",
    width = "325px"
  ) |>
  add_symbol_layer(
    id           = "state_labels",
    source       = us_states,
    text_field   = "{NAME}",   # Correctly reference the 'NAME' field for state names
    text_size    = 13,         # Set font size for labels
    text_color   = "black",
    text_halo_width = 1,       # Add halo around text for readability
    text_halo_color = "white"  # Set halo color
  )
      
map